-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Dynamic image size support for VLMs #5276
[Core] Dynamic image size support for VLMs #5276
Conversation
bdac3c9
to
25b5bb1
Compare
25b5bb1
to
9cfbcce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM - I'll just need to run some testing on my end before finally approving this!
93ad7de
to
cc540c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for the work and glad we resolved all the issues!
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: ywang96 <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: ywang96 <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: ywang96 <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: ywang96 <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: ywang96 <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alvant <[email protected]>
This PR uses the input registry introduced by #5214 to implement an input process that inserts image tokens automatically at the
LLMEngine
level, so that it applies toLLM.generate
.Accordingly, I have updated LLaVA-NeXT and Phi-3-Vision to support dynamic image size. Along the way, I have expanded the VLM tests to consider text-only and multiscale-image input in addition to the current single-scale image input.
Based on this, I have written a detailed guide on how to implement multimodal vLLM models.
Please note that this introduces a breaking change to users. Instead of manually repeating image tokens, the same prompt format as described in the corresponding HuggingFace repo should be used regardless of the model.
Related contributions
Follow-up to #5214.
This PR conflicts with #5237 as it inserts image tokens at the
OpenAIServing
level. This PR has removed such logic from the server to avoid double insertion.